{
    "cells": [
     {
      "cell_type": "markdown",
      "id": "1ae65fec",
      "metadata": {},
      "source": [
       "# Deploy the finetuned stanford_alpaca model on Amazon SageMaker\n",
       "\n",
       "As we have finetuned the model, next we will show you how to deploy the model on SageMaker.\n",
       "\n",
       "In this notebook, we explore how to host a large language model on SageMaker using the [Large Model Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-inference.html) container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent [blog post](https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/)."
      ]
     },
     {
      "cell_type": "markdown",
      "id": "e2407531",
      "metadata": {},
      "source": [
       "## Create a SageMaker Model for Deployment\n",
       "As a first step, we'll import the relevant libraries and configure several global variables such as the hosting image that will be used nd the S3 location of our model artifacts"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "5806a0f1",
      "metadata": {},
      "outputs": [],
      "source": [
       "import sagemaker\n",
       "from sagemaker.model import Model\n",
       "from sagemaker import serializers, deserializers\n",
       "from sagemaker import image_uris\n",
       "import boto3\n",
       "import os\n",
       "import time\n",
       "import json\n",
       "import jinja2\n",
       "from pathlib import Path"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "24862c4a",
      "metadata": {},
      "outputs": [],
      "source": [
       "role = sagemaker.get_execution_role()  # execution role for the endpoint\n",
       "sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs\n",
       "bucket = sess.default_bucket()  # bucket to house artifacts\n",
       "\n",
       "region = sess._region_name # region name of the current SageMaker Studio environment\n",
       "account_id = sess.account_id()  # account_id of the current SageMaker Studio environment\n",
       "\n",
       "s3_client = boto3.client(\"s3\") # client to intreract with S3 API\n",
       "sm_client = boto3.client(\"sagemaker\")  # client to intreract with SageMaker\n",
       "smr_client = boto3.client(\"sagemaker-runtime\") # client to intreract with SageMaker Endpoints\n",
       "jinja_env = jinja2.Environment() # jinja environment to generate model configuration templates"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "38930529",
      "metadata": {},
      "outputs": [],
      "source": [
       "# lookup the inference image uri based on our current region\n",
       "djl_inference_image_uri = (\n",
       "    f\"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117\"\n",
       ")"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "d00f2f2b",
      "metadata": {},
      "outputs": [],
      "source": [
       "pretrained_model_location = \"\"# Change to the model artifact path in S3 which we get from the fine tune job\n",
       "print(f\"Pretrained model will be downloaded from ---- > {pretrained_model_location}\")"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "29aec093",
      "metadata": {},
      "source": [
       "## Build the inference contianer image"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "7b814406",
      "metadata": {},
      "outputs": [],
      "source": [
       "%%writefile Dockerfile.inference\n",
       "## You should change below region code to the region you used, here sample is use us-west-2\n",
       "From 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117 \n",
       "\n",
       "ENV LANG=C.UTF-8\n",
       "ENV PYTHONUNBUFFERED=TRUE\n",
       "ENV PYTHONDONTWRITEBYTECODE=TRUE\n",
       "\n",
       "## Install transfomers version which support LLaMaTokenizer\n",
       "RUN python3 -m pip install git+https://github.com/huggingface/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176\n",
       "\n",
       "## Make all local GPUs visible\n",
       "ENV NVIDIA_VISIBLE_DEVICES=\"all\""
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "465b7a51",
      "metadata": {},
      "outputs": [],
      "source": [
       "## You should change below region code to the region you used, here sample is use us-west-2\n",
       "!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "351f761d",
      "metadata": {},
      "outputs": [],
      "source": [
       "## define repo name, should contain *sagemaker* in the name\n",
       "repo_name = \"sagemaker-alpaca-inference-demo\""
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "4a58a7ce",
      "metadata": {},
      "outputs": [],
      "source": [
       "%%script env repo_name=$repo_name bash\n",
       "\n",
       "#!/usr/bin/env bash\n",
       "\n",
       "# This script shows how to build the Docker image and push it to ECR to be ready for use\n",
       "# by SageMaker.\n",
       "\n",
       "# The argument to this script is the image name. This will be used as the image on the local\n",
       "# machine and combined with the account and region to form the repository name for ECR.\n",
       "# The name of our algorithm\n",
       "algorithm_name=${repo_name}\n",
       "\n",
       "account=$(aws sts get-caller-identity --query Account --output text)\n",
       "\n",
       "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
       "region=$(aws configure get region)\n",
       "region=${region:-us-west-2}\n",
       "\n",
       "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest\"\n",
       "\n",
       "# If the repository doesn't exist in ECR, create it.\n",
       "aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
       "\n",
       "if [ $? -ne 0 ]\n",
       "then\n",
       "    aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n",
       "fi\n",
       "\n",
       "# Get the login command from ECR and execute it directly\n",
       "aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}\n",
       "\n",
       "# Build the docker image locally with the image name and then push it to ECR\n",
       "# with the full name.\n",
       "\n",
       "docker build -t ${algorithm_name} -f Dockerfile.inference .\n",
       "docker tag ${algorithm_name} ${fullname}\n",
       "\n",
       "docker push ${fullname}"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "84497325",
      "metadata": {},
      "outputs": [],
      "source": [
       "## The image uri which is build and pushed above\n",
       "inference_image_uri = \"{}.dkr.ecr.{}.amazonaws.com/{}:latest\".format(account_id, region, repo_name)\n",
       "inference_image_uri"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "01e33506",
      "metadata": {},
      "source": [
       "## Deploying a Large Language Model using Hugging Face Accelerate\n",
       "The DJL Inference Image which we will be utilizing ships with a number of built-in inference handlers for a wide variety of tasks including:\n",
       "- `text-generation`\n",
       "- `question-answering`\n",
       "- `text-classification`\n",
       "- `token-classification`\n",
       "\n",
       "You can refer to this [GitRepo](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python) for a list of additional handlers and available NLP Tasks. <br>\n",
       "These handlers can be utilized as is without having to write any custom inference code. We simply need to create a `serving.properties` text file with our desired hosting options and package it up into a `tar.gz` artifact.\n",
       "\n",
       "Lets take a look at the `serving.properties` file that we'll be using for our first example"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "e3570119",
      "metadata": {},
      "outputs": [],
      "source": [
       "!mkdir accelerate_src"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "33c253cd",
      "metadata": {},
      "outputs": [],
      "source": [
       "%%writefile accelerate_src/serving.template\n",
       "engine=Python\n",
       "option.entryPoint=djl_python.huggingface\n",
       "option.s3url={{ s3url }}\n",
       "option.task=text-generation\n",
       "option.device_map=auto\n",
       "option.load_in_8bit=TRUE"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "1bbdefb3",
      "metadata": {},
      "outputs": [],
      "source": [
       "# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running\n",
       "template = jinja_env.from_string(Path(\"accelerate_src/serving.template\").open().read())\n",
       "Path(\"accelerate_src/serving.properties\").open(\"w\").write(template.render(s3url=pretrained_model_location))\n",
       "!pygmentize accelerate_src/serving.properties | cat -n"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "e46cffa8",
      "metadata": {},
      "source": [
       "There are a few options specified here. Lets go through them in turn<br>\n",
       "1. `engine` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the [DJL Python Engine](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python)\n",
       "2. `option.entryPoint` - specifies the entrypoint code that will be used to host the model. djl_python.huggingface refers to the `huggingface.py` module from [djl_python repo](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python).  \n",
       "3. `option.s3url` - specifies the location of the model files. Alternativelly an `option.model_id` option can be used instead to specifiy a model from Hugging Face Hub (e.g. `EleutherAI/gpt-j-6B`) and the model will be automatically downloaded from the Hub. The s3url approach is recommended as it allows you to host the model artifact within your own environment and enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance \n",
       "4. `option.task` - This is specific to the `huggingface.py` inference handler and specifies for which task this model will be used\n",
       "5. `option.device_map` - Enables layer-wise model partitioning through [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map). With `option.device_map=auto`, Accelerate will determine where to put each **layer** to maximize the use of your fastest devices (GPUs) and offload the rest on the CPU, or even the hard drive if you don’t have enough GPU RAM (or CPU RAM). Even if the model is split across several devices, it will run as you would normally expect.\n",
       "6. `option.load_in_8bit` - Quantizes the model weights to int8 thereby greatly reducing the memory footprint of the model from the initial FP32. See this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) from Hugging Face for additional information \n",
       "\n",
       "For more information on the available options, please refer to the [SageMaker Large Model Inference Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)\n",
       "\n",
       "Our initial approach here is to utilize the built-in functionality within Hugging Face Transformers to enable Large Language Model hosting. "
      ]
     },
     {
      "cell_type": "markdown",
      "id": "d156470a",
      "metadata": {},
      "source": [
       "We place the `serving.properties` file into a tarball and upload it to S3"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "12371518",
      "metadata": {},
      "outputs": [],
      "source": [
       "!tar czvf acc_model.tar.gz accelerate_src/ "
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "3098668f",
      "metadata": {},
      "outputs": [],
      "source": [
       "s3_code_prefix = \"llama/deploy/code\"\n",
       "\n",
       "code_artifact = sess.upload_data(\"acc_model.tar.gz\", bucket, s3_code_prefix)\n",
       "print(f\"S3 Code or Model tar ball uploaded to --- > {code_artifact}\")"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "e0807c58",
      "metadata": {},
      "source": [
       "## Deploy Model to a SageMaker Endpoint\n",
       "With a helper function we can now deploy our endpoint and invoke it with some sample inputs"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "c26dd42b",
      "metadata": {},
      "outputs": [],
      "source": [
       "def deploy_model(image_uri, model_data, role, endpoint_name, instance_type, sagemaker_session):\n",
       "    \n",
       "    \"\"\"Helper function to create the SageMaker Endpoint resources and return a predictor\"\"\"\n",
       "    model = Model(\n",
       "            image_uri=image_uri, \n",
       "              model_data=model_data, \n",
       "              role=role\n",
       "             )\n",
       "    \n",
       "    model.deploy(\n",
       "        initial_instance_count=1,\n",
       "        instance_type=instance_type,\n",
       "        endpoint_name=endpoint_name\n",
       "        )\n",
       "    \n",
       "    # our requests and responses will be in json format so we specify the serializer and the deserializer\n",
       "    predictor = sagemaker.Predictor(\n",
       "        endpoint_name=endpoint_name, \n",
       "        sagemaker_session=sagemaker_session, \n",
       "        serializer=serializers.JSONSerializer(), \n",
       "        deserializer=deserializers.JSONDeserializer())\n",
       "    \n",
       "    return predictor"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "d75889a6",
      "metadata": {},
      "outputs": [],
      "source": [
       "# creates a unique endpoint name\n",
       "endpoint_name = sagemaker.utils.name_from_base(\"alpaca-7B\")\n",
       "print(f\"Our endpoint will be called {endpoint_name}\")"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "a2bf4ee4",
      "metadata": {},
      "outputs": [],
      "source": [
       "# deployment will take about 10 minutes\n",
       "predictor = deploy_model(image_uri=inference_image_uri, \n",
       "                            model_data=code_artifact, \n",
       "                            role=role, \n",
       "                            endpoint_name=endpoint_name, \n",
       "                            instance_type=\"ml.g4dn.4xlarge\", \n",
       "                            sagemaker_session=sess)"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "b3c19a0c",
      "metadata": {},
      "source": [
       "Let's run an example with a basic text generation prompt Large model inference is"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "43611bc8",
      "metadata": {},
      "outputs": [],
      "source": [
       "predictor.predict({ \n",
       "                    \"inputs\" : \"Large model inference is\", \n",
       "                    \"parameters\": { \"max_length\": 50, \"temperature\": 0.5 }\n",
       "                })"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "5d5f35f7",
      "metadata": {},
      "outputs": [],
      "source": [
       "print(predictor.predict({ \"inputs\":\n",
       "                                \"\"\"Message: Support has been terrible for 2 weeks...\n",
       "                                Sentiment: Negative\n",
       "                                ###\n",
       "                                Message: I love your API, it is simple and so fast!\n",
       "                                Sentiment: Positive\n",
       "                                ###\n",
       "                                Message: GPT-J has been released 12 months ago.\n",
       "                                Sentiment: Neutral\n",
       "                                ###\n",
       "                                Message: The responsiveness of your team has been amazing, thank you so much!\n",
       "                                Sentiment:\"\"\",\n",
       "                      \"parameters\": { \"max_length\": 50, \"temperature\": 0.5 }\n",
       "                     }\n",
       "                    )[0][\"generated_text\"])"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "e033f0aa",
      "metadata": {},
      "source": [
       "Finally Let's do a quick benchmark to see what kind of latency we can expect from this model"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "61464f7d",
      "metadata": {},
      "outputs": [],
      "source": [
       "%%timeit -n3 -r1\n",
       "predictor.predict({ \n",
       "                    \"inputs\" : \"Large model inference is\", \n",
       "                    \"parameters\": { \"max_length\": 50, \"temperature\": 0.5 }\n",
       "                })"
      ]
     },
     {
      "cell_type": "code",
      "execution_count": null,
      "id": "a2da0c7d",
      "metadata": {},
      "outputs": [],
      "source": [
       "# Clean up the endpoint before proceeding\n",
       "predictor.delete_endpoint()"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "7377c84c",
      "metadata": {},
      "source": [
       "## Reference"
      ]
     },
     {
      "cell_type": "markdown",
      "id": "d1a842b9",
      "metadata": {},
      "source": [
       "[sagemaker-hosting/Large-Language-Model-Hosting/](https://github.com/aws-samples/sagemaker-hosting/tree/main/Large-Language-Model-Hosting)"
      ]
     }
    ],
    "metadata": {
     "kernelspec": {
      "display_name": "conda_pytorch_p39",
      "language": "python",
      "name": "conda_pytorch_p39"
     },
     "language_info": {
      "codemirror_mode": {
       "name": "ipython",
       "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.15"
     }
    },
    "nbformat": 4,
    "nbformat_minor": 5
   }